Materials+ML Workshop Day 8¶

Content for today:¶

Regression Models Review
- Linear Regression
- High-dimensional Embeddings
- Kernel Machines
Unsupervised Learning
- Feature Selection
- Dimensionality reduction
- Clustering
- Distribution Estimation
- Anomaly Detection
Application: Classifying Superconductors
- Application of unsupervised methods

The Workshop Online Book:¶

https://cburdine.github.io/materials-ml-workshop/¶

A link to our Workshop YouTube playlist is now available

Tentative Week 2 Schedule¶

Session	Date	Content
Day 6	06/16/2025 (2:00-4:00 PM)	Introduction to ML, Supervised Learning
Day 7	06/17/2025 (2:00-4:00 PM)	Advanced Regression Models
Day 8	06/18/2025 (2:00-5:00 PM)	Unsupervised Learning, Neural Networks
~~Day 9~~	~~06/19/2025 (2:00-4:00 PM)~~	~~Neural Networks~~
Day 10	06/20/2025 (2:00-5:00 PM)	Neural Networks, + Advanced Applications

Questions¶

Regression Models
- Linear Regression
- High-dimensional Embeddings
- Kernel Machines
- Supervised Learning (in general)

Review: Day 7¶

Multivariate Linear Regression¶

Multivariate Linear regression is a type of regression model that estimates a label as a linear combination of features:

$$\hat{y} = f(\mathbf{x}) = w_0 + \sum_{i=1}^D w_i x_i$$

We can re-write the linear regression model in vector form:

$$\underline{\mathbf{x}} = \begin{bmatrix} 1 & x_1 & x_2 & \dots & x_D \end{bmatrix}^T$$

($\mathbf{x}$ padded with a 1)

Let $$\mathbf{w} = \begin{bmatrix} w_0 & w_1 & w_2 & \dots & w_D \end{bmatrix}^T$$

(the weight vector)

$f(\mathbf{x})$ is just the inner product (i.e. dot product) of these two vectors:

$$\hat{y} = f(\mathbf{x}) = \underline{\mathbf{x}}^T\mathbf{w}$$

Closed Form Solution:¶

Multivariate Linear Regression:

$$\mathbf{w} = \mathbf{X}^+\mathbf{y}$$

Above, $\mathbf{X}^+$ denotes the Moore-Penrose inverse (sometimes called the pseudo-inverse) of $\mathbf{X}$.

If the dataset size $N$ is sufficiently large such that $\mathbf{X}$ has linearly independent columns, the optimal weights can be computed as:

$$\mathbf{w} = \mathbf{X}^{+}\mathbf{y} = \left( (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\right)\mathbf{y}$$

High-Dimensional Embeddings¶

Often, the trends of $y$ with respect to $\mathbf{x}$ are non-linear, so multivariate linear regression may fail to give good results.
One way of handling this is by embedding the data in a higher-dimensional space using many different non-linear functions:

$$\phi_j(\mathbf{x}) : \mathbb{R}^{D} \rightarrow \mathbb{R}\qquad (j = 1, 2, ..., D_{emb})$$

(The $\phi_j$ are nonlinear functions, and $D_{emb}$ is the embedding dimension)

$$\hat{y} = f(\mathbf{x}) = w_0 + \sum_{j=1}^{D_{emb}} w_j \phi_j(\mathbf{x})$$

Underfitting and Overfitting¶

Finding the best fit of a model requires striking a balance between underfitting and overfitting the data.

A model underfits the data if it has insufficient degrees of freedom to model the data.

A model overfits the data if it has too many degrees of freedom such that it fails to generalize well outside of the training data.

Polynomial Regression Example:

poly fits

Regularization:¶

To reduce overfitting, we apply regularization:

Usually, a penalty term is added to the overall model loss function:

$$\text{ Penalty Term } = \lambda \sum_{j} w_j^2 = \lambda(\mathbf{w}^T\mathbf{w})$$
The parameter $\lambda$ is called the regularization parameter
- as $\lambda$ increases, more regularization is applied.

Application: Band Gap Prediction¶

Finishing up yesterday's application

Today's Content:¶

Unsupervised Learning

Feature Selection
Dimensionality reduction
Clustering
Distribution Estimation
Anomaly Detection

Unsupervised Learning Models:¶

Models applied to unlabeled data with the goal of discovering trends, patterns, extracting features, or finding relationships between data.
Deals with datasets of features only
(just $\mathbf{x}$, not $(\mathbf{x},y)$ pairs)

Types of Unsupervised Learning Problems¶

unsupervised learning

Feature Selection and Dimensionality Reduction¶

Determines which features are the most "meaningful" in explaining how the data is distributed

Sometimes we work with high-dimensional data that is very sparse
Reducing the dimensionality of the data might be necessary
- Reduces computational complexity
- Eliminates unnecessary (or redundant) features
- Can even improve model accuracy

The Importance of Dimensionality¶

Dimensionality is an important concept in materials science.
- The dimensionality of a material affects its properties

Much like materials, the dimensionality of a dataset can say a lot about its properties:
- How complex is the data?
- Are there any "meaningless" or "redundant" features in the dataset?
- Does the data have fewer degrees of freedom than features?

Sometimes, data can be confined to some low-dimensional manifold embedded in a higher-dimensional space.

Example: The "Swiss Roll" manifold

Swiss roll

Review: The Covariance Matrix¶

The Covariance Matrix describes the variance of data in more than one dimension:

$$\mathbf{\Sigma} = \begin{bmatrix} \sigma_{1}^2 & \sigma_{12} & \dots & \sigma_{1d} \\ \sigma_{21} & \sigma_{2}^2 & \dots & \sigma_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{d1} & \sigma_{d2} & \dots & \sigma_{d}^2 \end{bmatrix}$$

$\Sigma_{ii} = \sigma_i^2$: variance in dimension $i$
$\Sigma_{ij} = \sigma_{ij}$: covariance between dimensions $i$ and $j$

$$\Sigma_{ij} = \frac{1}{N} \sum_{n=1}^N ((\mathbf{x}_n)_i - \mu_i)((\mathbf{x}_n)_j - \mu_j)$$

The Correlation Matrix:¶

Recall that it is generally a good idea to standardize our data:

$$\mathbf{x} \mapsto \mathbf{z}:\quad z_i = \frac{x_i - \mu_i}{\sigma_i}$$

The correlation matrix (denoted $\bar{\Sigma}$) is the covariance matrix of the standardized data:

$$ \bar{\Sigma} = \frac{1}{N} \sum_{n=1}^N \mathbf{z}_n\mathbf{z}_n^T $$

The entries of the correlation matrix (in terms of the original data) are:

$$\bar{\Sigma}_{ij} = \frac{1}{N} \sum_{n=1}^N \frac{((\mathbf{x}_n)_i - \mu_i)((\mathbf{x}_n)_j - \mu_j)}{\sigma_i\sigma_j}$$

Interpreting the Correlation Matrix¶